Background

Topic modeling is an unsupervised learning method used to discover “topics” across various text documents. Topic modeling is also useful when exploring large corpora to find clusters of words and/or similarities between documents. Finally, topic modeling is also used by search engines to match the topics found in a search string with documents and web pages that are centered around similar topics.

It can model objects as latent topics that reflect the meaning of a collection of documents. The topic is viewed as a recurring pattern of co-occurring words. A topic includes a group of words that often occurs together. Topic modeling can link words with the same context and differentiate across the uses of words with different meanings. Some common methods of Topic Modeling include Vector Space Model (VSM), Latent Semantic Indexing (LSI), Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). 1

The goal of topic modeling is to identify “hidden concepts” that run through documents by analyzing the words contained within the texts. These topics are abstract in nature, i.e., words which are related to each other form a topic. The basic idea can be thought about like this:

In a corpus, there exist \(M\) documents denoted by \(D_{i}, \;i = 1,2,...,M\) (each of these documents are visible to us)

Each of these documents contain \(N_{i}, \;i = 1,2,...,M\) words denoted by \(w_{ij}, \; i = 1,2,...,M,\; j = 1,2,...N_{i}\) (each of these words are also visible to us)

The question answered via topic modeling is: Why does document \(i\) contain words \(w_{ij}, \; j = 1,2,...N_{i}\)?

The answer is because there are latent (unseen) themes or “topics” running through each document and the words contained in each document are related to that topic.

Therefore, topic modeling can be thought of a dimensionality reduction technique in which we move from an \(N \times M\) word-space to a much smaller topic-space. This technique can be approached in two different ways:

  1. Generative model (only information relevant to the model is the number of times words are produced)-bag of words assumption

  2. Problem of statistical interference

2

Data Preparation

This document walks through various methods for topic modeling based on the phone_user_reviews data set. To accomplish this we’ll use the following packages.

pacman::p_load(tidytext,
               tidyverse,
               quanteda,
               stm,
               topicmodels,
               lsa,
               here,
               DT,
               data.table,
               stringr,
               twitteR,
               httr,
               lubridate,
               ggmap)

With these packages installed and loaded into our workspace, we proceed to ingesting the phone_user_reviews data set and preparing it for our analyses. This is done using the code in the chunks below. In this first chunk we locate the directory containing the six (6) .RData files that contain the data, read the data from each object and bind the data together in one complete data object called PUR (for phone user reviews).

# Locate the root of our project
root <- here::here()

# From the root, find the directory of files
# and select those with the .RData extension
pur <- list.files(path = file.path(root,"data","phone_user_reviews"),
                  pattern = ".RData",
                  full.names = T)

# We don't know the name of the object inside of each .RData file
# So, we create a new environment and load the object there
# We know there's only one object in this environment
# So, we don't need to know the object's name
# We just use ls() to list the objects and then get() the first one
# Using get() reads in the object into the global environment
env <- new.env()
load(pur[1], envir = env)
PUR <- data.table::data.table(get(ls(envir = env)[1], envir = env))

# After we're done we destroy the new environment that we created 
rm(env)

# Then repeat the process for the other files
# We make sure to rbind() the rows of each object together 
for(i in 2:length(pur)){
  
    env <- new.env()
    load(pur[i], envir = env)
    PUR_i <- data.table::data.table(get(ls(envir = env)[1], envir = env))
     
    PUR = rbind(PUR,PUR_i)
    rm(env)
  
}

In this next chunk we subset the full data set that we created above to consider only those that are written in English. Then we further subset this obejct to consider only those review that contain the words “samsung” and “edge” implying the review is about a Samsung-branded device with the name edge in the title. Finally, we reformat the date column such that we can select only those reviews that were posted on or after January 1st 2015 (the selected date is arbitrary).

PUR_en <- PUR[lang == "en" & country == "us"]

is_samsung <- stringr::str_detect(tolower(PUR_en$product), "samsung")

is_edge <- stringr::str_detect(tolower(PUR_en$product), "edge")

PUR_en_edge <- PUR_en[is_samsung & is_edge,]

PUR_en_edge[,date := as.Date(date, format = "%d/%m/%Y")]

PUR_en_edge_2015 <- PUR_en_edge[date > as.Date("1/1/2015",format = "%d/%m/%Y")]

DT::datatable(head(PUR_en_edge_2015, 100))

Next, we ingest the actual review, stored in the extract column as a quanteda corpus-class object using corpus().

PUR_corpus <- quanteda::corpus(PUR_en_edge_2015$extract)

Finally, we recognize that there are other columns in the data set that provide additional context to the review. We can store these as “document variables” using docvars, as shown below

quanteda::docvars(PUR_corpus, "date")    <- PUR_en_edge_2015$date
quanteda::docvars(PUR_corpus, "score")   <- PUR_en_edge_2015$score
quanteda::docvars(PUR_corpus, "source")  <- PUR_en_edge_2015$source
quanteda::docvars(PUR_corpus, "domain")  <- PUR_en_edge_2015$domain
quanteda::docvars(PUR_corpus, "product") <- PUR_en_edge_2015$product
quanteda::docvars(PUR_corpus, "country") <- PUR_en_edge_2015$country

DT::datatable(summary(PUR_corpus))

Vector Space Model (VSM)

Vector space model is a typical solution for keyword searches. It represents each document as a vector where each entry in the vector corresponds to a different word and the value at that entry corresponds to some measure of how significant that word is in the document (example measures include the number of times a word is present in the document or the term-frequency inverse document frequency value associated with a word). (3)

Model Overview

This model works by comparing vectors each of which represent an entire document and query. Thus, if document \(i\) contains \(t\) total words, then the information within document \(D_{i}\) can be represented by the following length \(t\) vector (4):

\[ D_i = (d_{i,1}, d_{i,2}, ... , d_{i,t}) \]

(5)

This is the document-term vector for document \(i\). If we join the document-term vectors for every document in our corpus we have the document-term matrix. For each element in our document-term matrix we typically like to work with the tf_idf weighting measure that we have already discussed in class to reflect the relative importance of each word (6). Recall that tf, term frequency, is simply the number of times each term appears in a document and that idf, inverse document frequency, concerns the number of times a term appears across all documents. The idf value is calculated as follows.

\[ \mbox{idf}(t,D) = \log\left(\frac{N}{n_{t}}\right) \]

The tf-idf value is then calculated as follows.

\[ \mbox{tf-idf}(t,d,D) = \mbox{tf}(t,d) \cdot \mbox{idf}(t,D) \]

Words with high tf-idf values are used often in a subset of the documents in the corpus, but not very often throughout all the documents. We can then use methods like cosine similarity to compare documents or queries (7).

\[ cos(q,d) = \frac{q.d}{||q||.||d||} \]

We then rank the documents according to their similarity to the query. The cosine similarity concept can also be shown visually.

Cosine Similarity (8)

A Simple Example

Consider this example from the University of Ottawa which considers three documents - each containing three words.

  1. “new york times”
  2. “new york post”
  3. “los angeles times”

The tf-idf values are calculated according to the formula above. Below is an example calculation for “new.” The word appears in two documents, once in the first and once in the second. This is a simple example because the term frequencies per document are all either 1 or 0, so the tf-idf values only depend on idf. Below are the calculations.

new     <- log(3/2)
york    <- log(3/2)
times   <- log(3/2)
post    <- log(3/1)
los     <- log(3/1)
angeles <- log(3/1)

The tf-idf matrix is then a collection of vectors representing each document.

\[ \text{document = (new, york, times, post, los, angeles)} \]

(doc1 <- c(new*1, york*1, times*1, post*0, los*0, angeles*0))
[1] 0.4054651 0.4054651 0.4054651 0.0000000 0.0000000 0.0000000
(doc2 <- c(new*1, york*1, times*0, post*1, los*0, angeles*0))
[1] 0.4054651 0.4054651 0.0000000 1.0986123 0.0000000 0.0000000
(doc3 <- c(new*0, york*0, times*1, post*0, los*1, angeles*1))
[1] 0.0000000 0.0000000 0.4054651 0.0000000 1.0986123 1.0986123

How similar are the documents?

(doc1 %*% doc2) / (sqrt(sum(doc1 ^ 2)) * sqrt(sum(doc2 ^ 2)))
          [,1]
[1,] 0.3778002
(doc1 %*% doc3) / (sqrt(sum(doc3 ^ 2)) * sqrt(sum(doc1 ^ 2)))
          [,1]
[1,] 0.1457895
(doc2 %*% doc3) / (sqrt(sum(doc2 ^ 2)) * sqrt(sum(doc3 ^ 2)))
     [,1]
[1,]    0

Which document is most similar to the “new new times” query? To calculate tf-idf for a query, divide the frequency for that word by the maximum frequency for any word.

(query = c(new * (2/2), york * 0, times * (1 / 2), post * 0, los * 0, angeles * 0))
[1] 0.4054651 0.0000000 0.2027326 0.0000000 0.0000000 0.0000000
(doc1 %*% query) / (sqrt(sum(doc1 ^ 2)) * sqrt(sum(query ^ 2)))
          [,1]
[1,] 0.7745967
(doc2 %*% query) / (sqrt(sum(doc2 ^ 2)) * sqrt(sum(query ^ 2)))
          [,1]
[1,] 0.2926428
(doc3 %*% query) / (sqrt(sum(doc3 ^ 2)) * sqrt(sum(query ^ 2)))
         [,1]
[1,] 0.112928

As expected, document 1 would be the first result when searching this query.

Latent Semantic Indexing (LSI)/Latent Semantic Analysis (LSA)

(9)

Latent Semantic Analysis (LSA) is an approach to automatic indexing and information retrieval that attempts to overcome some problems with VSM by mapping documents as well as terms to a representation in the so-called latent semantic space. LSA usually takes the (high-dimensional) vector space representation of documents based on term frequencies as a starting point and applies a dimension reducing linear projection. The specific form of this mapping is determined by a given document collection and is based on a singular value decomposition (SVD) of the corresponding document-term matrix. The general claim is that similarities between documents or between documents and queries can be more reliably estimated in the reduced latent space representation than in the original representation. The rationale is that documents which share frequently co-occurring terms will have a similar representation in the latent space, even if they have no terms in common. LSA thus performs some sort of noise reduction and has the potential benefit to detect synonyms as well as words that refer to the same topic. In many applications this has proven to result in more robust word processing. (10)

LSI (Latent Semantic Indexing) is a way that search engines determine whether your content is really on-topic and in-depth or just spam. The search engines determine this by looking at the words in an article and deciding how relevant they are to each other.

An example is a web search of “windows.” If you are searching for “windows”, there are hundreds of related keywords that you can think of:

“Bill Gates”

“Microsoft”

“Windows 10”

“Surface tablet”

These keywords are naturally grouped together and rightly so as these are the potential LSI keywords when writing a post about “windows.” LSI also helps to differentiate from the “other” windows:

“Window cleaning”

“Double glazed windows”

“Wooden windows”

“Window locks” (11)

Probabilistic Latent Semantic Indexing (pLSI/pLSA)

Anchor Sequences Anchor Sequences (12)

Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain specific synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing (LSI) by singular value decomposition, the probabilistic variant has a statistical foundation and defines a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over LSI. In particular, the combination of models with different dimensionalities has proven to be advantageous. (13)

Model

In LSA, we use truncated singular value decomposition (SVD). SVD is a linear algebra technique that factorizes a matrix into three separate matrices. This method learns about the latent topics in the documents by performing matrix decomposition on the document-term matrix. It is intuitive that this matrix is very sparse and noisy, so we need to reduce dimensionality in order to find the relationships between words and documents. Like VSM, some people prefer to use the tf-idf values in the matrix. The formula for truncated SVD is as follows (14).

\[ A = U_tS_tV_t^T \] One way to think about this process is that we are keeping the t most important dimensions, where t is a number we choose ahead of time based on how many topics we want to extract.

LSA (15)

The \(U\) matrix is in the term space and the V matrix is in the document space. The columns correspond to each of our topics. So if t is two, we keep two columns of each. With these matrices, we can then apply cosine similarity or other measures.

A Simple Example

Consider the following documents (16).

  1. “Shipment of gold damaged in a fire”
  2. “Delivery of silver arrived in a silver truck”
  3. “Shipment of gold arrived in a truck”
DTM (17)
mysvd <- svd(a)
(u <- mysvd$u)
            [,1]        [,2]        [,3]
 [1,] -0.4201216 -0.07479925 -0.04597244
 [2,] -0.2994868  0.20009226  0.40782766
 [3,] -0.1206348 -0.27489151 -0.45380010
 [4,] -0.1575610  0.30464762 -0.20064670
 [5,] -0.1206348 -0.27489151 -0.45380010
 [6,] -0.2625606 -0.37944687  0.15467426
 [7,] -0.4201216 -0.07479925 -0.04597244
 [8,] -0.4201216 -0.07479925 -0.04597244
 [9,] -0.2625606 -0.37944687  0.15467426
[10,] -0.3151220  0.60929523 -0.40129339
[11,] -0.2994868  0.20009226  0.40782766
(v <- mysvd$v)
           [,1]       [,2]       [,3]
[1,] -0.4944666 -0.6491758 -0.5779910
[2,] -0.6458224  0.7194469 -0.2555574
[3,] -0.5817355 -0.2469149  0.7749947
(s <- diag(mysvd$d))
         [,1]     [,2]     [,3]
[1,] 4.098872 0.000000 0.000000
[2,] 0.000000 2.361571 0.000000
[3,] 0.000000 0.000000 1.273669
u %*% s %*% t(v)
              [,1]          [,2]          [,3]
 [1,] 1.000000e+00  1.000000e+00  1.000000e+00
 [2,] 1.387779e-15  1.000000e+00  1.000000e+00
 [3,] 1.000000e+00 -4.996004e-16  1.720846e-15
 [4,] 6.383782e-16  1.000000e+00 -9.992007e-16
 [5,] 1.000000e+00 -4.718448e-16  1.554312e-15
 [6,] 1.000000e+00 -8.534840e-16  1.000000e+00
 [7,] 1.000000e+00  1.000000e+00  1.000000e+00
 [8,] 1.000000e+00  1.000000e+00  1.000000e+00
 [9,] 1.000000e+00 -8.534840e-16  1.000000e+00
[10,] 1.276756e-15  2.000000e+00 -1.998401e-15
[11,] 1.276756e-15  1.000000e+00  1.000000e+00

As discussed earlier, U is our term-topic matrix and V is our document-topic matrix. If we want to look at two topics, we keep the first two columns of U and V and the first two rows and columns of S.

(u <- u[,1:2])
            [,1]        [,2]
 [1,] -0.4201216 -0.07479925
 [2,] -0.2994868  0.20009226
 [3,] -0.1206348 -0.27489151
 [4,] -0.1575610  0.30464762
 [5,] -0.1206348 -0.27489151
 [6,] -0.2625606 -0.37944687
 [7,] -0.4201216 -0.07479925
 [8,] -0.4201216 -0.07479925
 [9,] -0.2625606 -0.37944687
[10,] -0.3151220  0.60929523
[11,] -0.2994868  0.20009226
(v <- v[,1:2])
           [,1]       [,2]
[1,] -0.4944666 -0.6491758
[2,] -0.6458224  0.7194469
[3,] -0.5817355 -0.2469149
(s <- s[1:2,1:2])
         [,1]     [,2]
[1,] 4.098872 0.000000
[2,] 0.000000 2.361571

Let’s look for the query “gold silver truck” using the following formula (18).

\[ q = q^TU_tS_t^{-1} \]

(q <- matrix(c(0,0,0,0,0,1,0,0,0,1,1), 11, 1,byrow=TRUE))
      [,1]
 [1,]    0
 [2,]    0
 [3,]    0
 [4,]    0
 [5,]    0
 [6,]    1
 [7,]    0
 [8,]    0
 [9,]    0
[10,]    1
[11,]    1
(q <- as.vector(t(q) %*% u %*% solve(s)))
[1] -0.2140026  0.1820571

We now have the coordinates of the query and the coordinates of each of the documents.

(doc1 <- v[1,])
[1] -0.4944666 -0.6491758
(doc2 <- v[2,])
[1] -0.6458224  0.7194469
(doc3 <- v[3,])
[1] -0.5817355 -0.2469149

We now use the same process as before to find the most relevant document.

(doc1 %*% q)/(sqrt(sum(doc1^2))*sqrt(sum(q^2)))
            [,1]
[1,] -0.05395084
(doc2 %*% q)/(sqrt(sum(doc2^2))*sqrt(sum(q^2)))
          [,1]
[1,] 0.9909874
(doc3 %*% q)/(sqrt(sum(doc3^2))*sqrt(sum(q^2)))
          [,1]
[1,] 0.4479595

Document two would be the top result for this query. This can also be shown visually.

Coordinates (19)

pLSA Model

Model

The PLSA model was meant to improve upon LSA by adding probabilistic concepts. The model revolves around two main assumptions. Topic z is present in document d with probability \(P(z|d)\) and word w is present in topic z with \(P(w|z)\). The joint probability of seeing a document d and word w together is shown below (20).

\[ P(D|W) = P(D)\sum_z{P(Z|D)P(W|Z)} = \sum_z{P(Z)P(D|Z)P(W|Z)} \]

The terms on the right side are the parameters of the PLSA model. While some are learned through direct observation of the corpus, others are treated as multinomial distributions and are calculated using a process called expectation-maximization (EM). The first formulation is called the asymmetric formulation and the other is the symmetric formulation. The second formulation is perfectly symmetric in entities, documents and words (21). The difference is that we start with the document and generate the topic and word with some probability in the first formulation. In the second, we start with a topic and then generate the document and the word. With the second formulation, there is a direct connection to LSA.

Connection between LSA and PLSA (22)

This connection makes clear that the only difference LSA and PLSA, as expected, is the inclusion of probabilistic concepts. The P(D|Z) term relates to U and the P(W|Z) term relates to V.

There are little to no openly available examples using PLSA. People interested in topic modeling seem to gravitate toward LSA or LDA.

Latent Dirichlet Allocation(LDA)

LDA is a more recent (and more popular) method for topic modeling, compared to VSM, LSA, and pLSA. The main difference between pLSA and LDA is the incorporation of Bayesian concepts. LDA treats each document as a mixture of topics and each topic as a mixture of words. LDA estimates both of these mixtures at the same time to find the mixture of topics that best describes each document. LDA broken down

LDA is great at producing easily understandable output. For example, if we search an ESPN database we might find that Topic 1 is best represented by “NFL, Super, Bowl, football, coach, quarterback” and Topic 2 is represented by “NBA, LeBron, Steph, Warriors, coach.” If a new ESPN article is published, we could find the topic mixture for that article based on the information learned from our corpus. The ability to quickly interpret new articles is the main advantage over PLSA (23). Those working with LDA use software to produce quick results. There are multiple packages that allow you to run LDA in R with two of the main packages being lda and topicmodels.

Model

We consider that each document is made up of a number of topics and each topic is made up of a number of words. Integral to LDA is the use of Dirichlet priors for the document-topic and word-topic distributions. The Dirichlet distribution (after Peter Gustav Lejeune Dirichlet), denoted as \({\displaystyle \operatorname {Dir} ({\boldsymbol {\alpha }})}\), is a family of continuous multivariate probability distributions parameterized by a vector of concentration parameters \(\alpha_i \in \mathbb{R}^+,\; \forall i\in \mathbb{N}\). It is a multivariate generalization of the beta distribution (and is sometimes called the multivariate beta distribution). Dirichlet distributions are often used as prior distributions in Bayesian statistics as they are conjugate for the categorical distribution and multinomial distribution. The Dirichlet Distribution is often called a “distribution of distributions” (24). The general process is outlined in the figure below.

LDA Process (25)
(26)

As in other topic modeling methods, we choose \(k\) to be the number of topics we want to represent each document in our corpus. However, in LDA there are the additional parameters \(\alpha\) and \(\beta\). While \(\alpha\) relates to the prior weight of topic \(k\) in a document, the other parameter \(\beta\) relates to the prior weight of word \(w\) in a topic. We usually set these to very low values such as 0.1 or 0.01 because we expect there to be few words per topic and few topics per document (27).

Pros and Cons

Features and Limitations(28)

A “Simple” Example

This example examines the topics associated with tweets posted to Twitter. Specifically, we want to look at tweets with the following properties:

  • Include a specific search term or hashtag
  • Were posted within a specified time
  • Were posted by users in and around a specified region of the united states

If you want to do this yourself, you’ll need to create your own Twitter Developer account and get issued the required set of API keys and tokens. Then you can substitute your own tokens in the functions below to moke your own requests to the Twitter API.

In the chunk below I submit my personal token to setup OAuth authorization with the Twitter API (I’m letting them know who is making the requests).

# You'll need to get your own keys by 
# creating a Twitter developer account
twitter_consumer_key    <- jkf::key_chain('twitter.api')
twitter_consumer_secret <- jkf::key_chain('twitter.api.secret')
twitter_access_token    <- jkf::key_chain('twitter.token')
twitter_access_secret   <- jkf::key_chain('twitter.token.secret')

twitteR::setup_twitter_oauth(consumer_key    = twitter_consumer_key,
                             consumer_secret = twitter_consumer_secret,
                             access_token    = twitter_access_token,
                             access_secret   = twitter_access_secret)

The code in the next chunk is the workhorse function that gets the Tweets we want and returns them in a data.frame for easy analysis.

get_tweets <- function(text = '#trump',
                       location = NULL,
                       dist = NULL,
                       units = NULL,
                       map_source = "google",
                       key = jkf::key_chain("gmapsAPI"),...) {

   lat_long_dist <- NULL
   locale <- NULL
  
if(!(is.null(location) | missing(location))) {
  
   lat_long <- ggmap::geocode(location,
                              source = map_source,
                              key = key)
   
   if(is.null(dist)  | missing(dist))  dist  = '20'
   if(is.null(units) | missing(units)) units = 'mi'
   
   lat_long_dist <- glue::glue("{lat_long[2]},{lat_long[1]},{dist}{units}")
   
   locale = 'ja'
   
}

tweets <- twitteR::searchTwitter(searchString = text, 
                                 geocode = lat_long_dist,
                                 locale = locale,...)

# Convert List of tweets to data.frame
tweet_df <- twitteR::twListToDF(tweets)

return(tweet_df)
                       
}

Then we use the function to get the tweets

Tweets <- get_tweets(text = '#trump',
                       location = "Dayton, OH",
                       dist = "50",
                       units = "mi",
                       map_source = "google")
DT::datatable(Tweets)

Data Prep

As you can see, there are three columns and 650 observations (Tweets). Next we’ll transform the data to a Tidy data structure and list the most used words outside of stop words.

Tidy

Tidy_Tweets <- Tweets %>%
  tidytext::unnest_tokens(word,text) %>%
  dplyr::anti_join(tidytext::stop_words) %>%
  dplyr::add_count(word, sort = TRUE)
Tidy_Tweets
dfm_Tweets <- Tidy_Tweets %>%
  tidytext::cast_dfm(id, word, n)
dfm_Tweets

Topic Model Using STM

topic_model <- stm::stm(dfm_Tweets, K = 4, init.type = "LDA")

summary(topic_model)

Visualization

topic_model <- tidytext::tidy(topic_model)

We’ll use ggplot to graph the beta to see which words are contributing the most to which topic

 topic_model %>%
  group_by(topic) %>%
  top_n(7) %>%
  ungroup %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = topic)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~topic, scales = "free") +
  coord_flip()
topic_model
PUR_dfm <- quanteda::dfm(PUR_corpus,
                         remove = quanteda::stopwords("english"),
                         stem = !TRUE, 
                         remove_punct = TRUE)
PUR_lda <- topicmodels::LDA(convert(PUR_dfm, to = "topicmodels"), k = 10)

topicmodels::get_terms(PUR_lda, 5)

Let’s clean up our data

PUR_en_edge_2015$extract <- qdap::mgsub(pattern = c("samsung","edge","phone","galaxy"), 
                                        replacement = "",
                                        text.var = tolower(PUR_en_edge_2015$extract))
PUR_corpus <- quanteda::corpus(PUR_en_edge_2015$extract)
quanteda::docvars(PUR_corpus, "date")    <- PUR_en_edge_2015$date
quanteda::docvars(PUR_corpus, "score")   <- PUR_en_edge_2015$score
quanteda::docvars(PUR_corpus, "source")  <- PUR_en_edge_2015$source
quanteda::docvars(PUR_corpus, "domain")  <- PUR_en_edge_2015$domain
quanteda::docvars(PUR_corpus, "product") <- PUR_en_edge_2015$product
quanteda::docvars(PUR_corpus, "country") <- PUR_en_edge_2015$country

DT::datatable(summary(PUR_corpus))
Warning in nsentence.character(object, ...): nsentence() does not correctly
count sentences in all lower-cased text
PUR_dfm <- quanteda::dfm(PUR_corpus,
                         remove = quanteda::stopwords("english"),
                         stem = !TRUE, 
                         remove_punct = TRUE)
PUR_lda <- topicmodels::LDA(convert(PUR_dfm, to = "topicmodels"), k = 10)

topicmodels::get_terms(PUR_lda, 5)